from GenderPayGap import*
df = pd.read_excel('sample_salary.xls')
df.head()
| SCALE | TENURE | ZONE | CENTER | GENDER | TYPE | SALARY | |
|---|---|---|---|---|---|---|---|
| 0 | 145 | 1 | WEST | Emp: 36 to 50 | FEMALE | FULL-TIME | 17515 |
| 1 | 145 | 1 | WEST | Emp: 36 to 50 | FEMALE | FULL-TIME | 16031 |
| 2 | 145 | 11 | WEST | Emp: 10 to 20 | FEMALE | FULL-TIME | 21633 |
| 3 | 145 | 9 | WEST | Emp: 10 to 20 | FEMALE | FULL-TIME | 20001 |
| 4 | 145 | 2 | WEST | Emp: 10 to 20 | FEMALE | FULL-TIME | 19623 |
df.SCALE=df.SCALE-df.SCALE.min()
example= GenderPayGap(df,'GENDER', 'SALARY', swap=True)
| GENDER | MALE | FEMALE | RawGAP | %RawGAP |
|---|---|---|---|---|
| SALARY | 30,802.15 | 22,748.05 | 8,054.10 | 26.15 |
example.exploratory_data_analysis()
SCALE Skew : 2.88
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| GENDER | ||||||||
| FEMALE | 152.00 | 41.78 | 34.95 | 0.00 | 0.00 | 55.00 | 70.00 | 111.00 |
| MALE | 538.00 | 65.63 | 38.02 | 0.00 | 55.00 | 55.00 | 55.00 | 399.00 |
TENURE Skew : 1.28
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| GENDER | ||||||||
| FEMALE | 152.00 | 5.54 | 5.37 | 1.00 | 2.00 | 4.00 | 7.00 | 41.00 |
| MALE | 538.00 | 12.01 | 9.21 | 1.00 | 4.00 | 10.00 | 18.00 | 49.00 |
SALARY Skew : 4.0
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| GENDER | ||||||||
| FEMALE | 152.00 | 22,748.05 | 6,621.03 | 12,025.00 | 17,987.50 | 21,979.00 | 26,072.75 | 51,186.00 |
| MALE | 538.00 | 30,802.15 | 13,832.55 | 12,960.00 | 24,311.00 | 27,431.00 | 32,835.25 | 162,953.00 |
| count | unique | top | freq | |
|---|---|---|---|---|
| ZONE | 690 | 5 | NORTH | 257 |
| count | unique | top | freq | |
|---|---|---|---|---|
| CENTER | 690 | 5 | Emp: >50 | 268 |
| count | unique | top | freq | |
|---|---|---|---|---|
| GENDER | 690 | 2 | MALE | 538 |
| count | unique | top | freq | |
|---|---|---|---|---|
| TYPE | 690 | 2 | FULL-TIME | 672 |
Show polinomial plots for numerical variables
example.prepare_data(max_unique=45,column_to_exp='SCALE', exponent=2, drop_original=True)
Identified columns to encode... ['ZONE', 'CENTER', 'GENDER', 'TYPE'] Included for encoding.......... ZONE Included for encoding.......... CENTER Included for encoding.......... TYPE New column added............... SCALE**2 ( SCALE raised to the power of 2 ) Original column droped........... SCALE New dataframe total columns.... 13
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| TENURE | 1 | 1 | 11 | 9 | 2 |
| GENDER_FEMALE | 1 | 1 | 1 | 1 | 1 |
| SALARY | 17515 | 16031 | 21633 | 20001 | 19623 |
| ZONE_EAST | 0 | 0 | 0 | 0 | 0 |
| ZONE_NORTH | 0 | 0 | 0 | 0 | 0 |
| ZONE_SOUTH | 0 | 0 | 0 | 0 | 0 |
| ZONE_WEST | 1 | 1 | 1 | 1 | 1 |
| CENTER_Emp: 21 to 35 | 0 | 0 | 0 | 0 | 0 |
| CENTER_Emp: 36 to 50 | 1 | 1 | 0 | 0 | 0 |
| CENTER_Emp: <10 | 0 | 0 | 0 | 0 | 0 |
| CENTER_Emp: >50 | 0 | 0 | 0 | 0 | 0 |
| TYPE_PART-TIME | 0 | 0 | 0 | 0 | 0 |
| SCALE**2 | 0 | 0 | 0 | 0 | 0 |
example.select_significant()
Initial columns.................................... 13 Constant column added for Ordinary Least Squares regression Adjusted r-square with original variables ......... 0.808176844474361 Variables to drop ( "p-value" > 0.05 )............. 5 Variables dropped:................................. ['ZONE_EAST', 'ZONE_NORTH', 'CENTER_Emp: 36 to 50', 'CENTER_Emp: <10', 'CENTER_Emp: >50'] Adjuster r-square with significant variables....... 0.8036018924535696 Final variables considered......................... 9
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| const | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| TENURE | 1.00 | 1.00 | 11.00 | 9.00 | 2.00 |
| GENDER_FEMALE | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| SALARY | 17,515.00 | 16,031.00 | 21,633.00 | 20,001.00 | 19,623.00 |
| ZONE_SOUTH | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| ZONE_WEST | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| CENTER_Emp: 21 to 35 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| TYPE_PART-TIME | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| SCALE**2 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
example.plot_coefficients()
example.avg_decomposition(width=None, height=None)
Salary decomposition with significant p-values (> 0.05 ) No width and height specified
| Value_STD | Value_MIN | Value_MAX | Value_MEAN | Coefficients | Salary_STD | Salary_MEAN | |
|---|---|---|---|---|---|---|---|
| const | 0.00 | 1.00 | 1.00 | 1.00 | 21,146.06 | 0.00 | 21,146.06 |
| SCALE**2 | 10,063.02 | 0.00 | 159,201.00 | 5,134.84 | 0.92 | 9,216.49 | 4,702.88 |
| TENURE | 8.93 | 1.00 | 49.00 | 10.58 | 359.36 | 3,208.27 | 3,802.92 |
| ZONE_SOUTH | 0.33 | 0.00 | 1.00 | 0.12 | 8,401.30 | 2,749.08 | 1,022.77 |
| TYPE_PART-TIME | 0.16 | 0.00 | 1.00 | 0.03 | -3,294.54 | -525.51 | -85.94 |
| GENDER_FEMALE | 0.41 | 0.00 | 1.00 | 0.22 | -2,233.70 | -926.41 | -492.06 |
| ZONE_WEST | 0.43 | 0.00 | 1.00 | 0.24 | -2,152.28 | -914.88 | -508.44 |
| CENTER_Emp: 21 to 35 | 0.39 | 0.00 | 1.00 | 0.19 | -2,928.69 | -1,152.77 | -560.27 |
| Total | 10,073.67 | 2.00 | 159,256.00 | 5,147.21 | 19,298.42 | 11,654.28 | 29,027.91 |
example.gap_decomposition(width=None, height=None)
Salary decomposition with significant p-values (> 0.05 ) No width and height specified
| MALE | FEMALE | Value_GAP | Coefficients | MALE_PAY | FEMALE_PAY | Salary_GAP | Percentage_GAP | |
|---|---|---|---|---|---|---|---|---|
| SCALE**2 | 5,749.64 | 2,958.75 | 2,790.89 | 0.92 | 5,265.97 | 2,709.85 | 2,556.11 | 48.54 |
| TENURE | 12.01 | 5.54 | 6.47 | 359.36 | 4,314.94 | 1,990.64 | 2,324.30 | 53.87 |
| GENDER_FEMALE | 0.00 | 1.00 | NaN | -2,233.70 | -0.00 | -2,233.70 | 2,233.70 | -inf |
| ZONE_SOUTH | 0.14 | 0.04 | 0.11 | 8,401.30 | 1,218.03 | 331.63 | 886.40 | 72.77 |
| TYPE_PART-TIME | 0.01 | 0.07 | -0.06 | -3,294.54 | -42.87 | -238.42 | 195.55 | -456.15 |
| const | 1.00 | 1.00 | 0.00 | 21,146.06 | 21,146.06 | 21,146.06 | 0.00 | 0.00 |
| ZONE_WEST | 0.24 | 0.23 | 0.01 | -2,152.28 | -512.07 | -495.59 | -16.48 | 3.22 |
| CENTER_Emp: 21 to 35 | 0.20 | 0.16 | 0.04 | -2,928.69 | -587.92 | -462.42 | -125.49 | 21.34 |
| Total | 5,763.25 | 2,966.79 | 2,797.46 | 19,298.42 | 30,802.15 | 22,748.05 | 8,054.10 | 26.15 |
example.gap_summary()
Adjusted r-square....... 0.8036018924535696
| GENDER | MALE | FEMALE | RawGAP | %RawGAP | AdjustedGAP | % AdjustedGAP |
|---|---|---|---|---|---|---|
| SALARY | 30,802.15 | 22,748.05 | 8,054.10 | 26.15 | 2,233.70 | 7.25 |
| OaxacaB_Two-Fold | |
|---|---|
| MALE_PAY | 30,802.15 |
| FEMALE_PAY | 22,748.05 |
| RawGAP | 8,054.10 |
| FEMALE_PAY_Predicted | 24,981.75 |
| Explained_GAP | 5,820.40 |
| Unexplained_GAP | 2,233.70 |
| R-square_adj | 0.80 |
example.ols_first.summary2()
| Model: | OLS | Adj. R-squared: | 0.808 |
| Dependent Variable: | SALARY | AIC: | 13907.6682 |
| Date: | 2022-09-08 17:20 | BIC: | 13966.6452 |
| No. Observations: | 690 | Log-Likelihood: | -6940.8 |
| Df Model: | 12 | F-statistic: | 242.9 |
| Df Residuals: | 677 | Prob (F-statistic): | 6.53e-236 |
| R-squared: | 0.812 | Scale: | 3.2590e+07 |
| Coef. | Std.Err. | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 23220.1277 | 1252.0285 | 18.5460 | 0.0000 | 20761.8021 | 25678.4533 |
| TENURE | 362.3815 | 26.9964 | 13.4233 | 0.0000 | 309.3747 | 415.3883 |
| GENDER_FEMALE | -1976.0477 | 581.3050 | -3.3993 | 0.0007 | -3117.4251 | -834.6702 |
| ZONE_EAST | -181.8904 | 1551.2432 | -0.1173 | 0.9067 | -3227.7165 | 2863.9356 |
| ZONE_NORTH | -2137.6259 | 1503.3425 | -1.4219 | 0.1555 | -5089.4002 | 814.1484 |
| ZONE_SOUTH | 6823.4378 | 1570.0954 | 4.3459 | 0.0000 | 3740.5960 | 9906.2797 |
| ZONE_WEST | -3536.9049 | 1490.4893 | -2.3730 | 0.0179 | -6463.4421 | -610.3676 |
| CENTER_Emp: 21 to 35 | -3752.1336 | 962.0835 | -3.9000 | 0.0001 | -5641.1598 | -1863.1073 |
| CENTER_Emp: 36 to 50 | -1831.6376 | 957.1823 | -1.9136 | 0.0561 | -3711.0403 | 47.7652 |
| CENTER_Emp: <10 | -277.9501 | 1065.7298 | -0.2608 | 0.7943 | -2370.4831 | 1814.5828 |
| CENTER_Emp: >50 | -666.2423 | 908.5699 | -0.7333 | 0.4636 | -2450.1959 | 1117.7113 |
| TYPE_PART-TIME | -2872.4677 | 1417.3445 | -2.0267 | 0.0431 | -5655.3872 | -89.5482 |
| SCALE**2 | 0.9195 | 0.0233 | 39.4652 | 0.0000 | 0.8738 | 0.9653 |
| Omnibus: | 123.426 | Durbin-Watson: | 1.328 |
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 1908.281 |
| Skew: | -0.234 | Prob(JB): | 0.000 |
| Kurtosis: | 11.134 | Condition No.: | 167575 |